updated: 2021-04-28

Table of Contents

Input data overview

To total number of samples and features before and after preprocessing for each taxnomic level.
level n_samples n_features n_features_preproc
phylum 490 19 9
class 490 36 19
order 490 65 28
family 490 124 54
genus 490 316 115
otu 490 20079 705
asv 490 104106 478

Hyperparameter Performance

Plots of HP performance for each model/taxonomic level

Random Forest

The default hyperparameter (mtry) selection for RF is:

  1. sqrt_features / 2
  2. sqrt_features
  3. sqrt_features * 2

Based on previous results, I added mtry=1 for Phylum through Genus levels and mtry=100 for OTU and ASV levels. This set of hyperparameters seems to cover the optimal range for each taxonomic level.

Logistic Regression

Two metrics: alpha and lambda
By default alpha is set to zero (for L2 regularization) and lambda values are: 1e-04, 1e-03, 1e-02, 1e-01, 1e+00, 1e+01 I added 1e-05 as a lambda value for phylum and class, and higher values of lambda for the rest

Decision Tree

The default hyperparameters were retained for decision tree (maxdepth = 1 2 4 8 16 30). It would not allow larger maxdepth than 30.

XGBoost

(needs adjustment)

SVM Radial

Model Performance

DADA2 Comparison

Since there are some that believe ASVs generated by Mothur are not as good as ASVs generated by DADA2, I ran a comparison. I used DADA2 (v1.18.0) to generate ASVs and ran them through the same mikropml model pipeline. I ran dada(…, pool=T) to caputure more rare ASVs and subsampled to XX reads after processing.

Below is a summary table of the number of features at each level. Interestingly there are only 5,508 ASVs identified with DADA2 compared to 104,106 with Mothur. After preprocessing there are just 630 ASVs, about half of what we find with Mothur.

level n_samples n_features n_features_preproc
phylum 490 19 9
class 490 36 19
order 490 65 28
family 490 124 54
genus 490 316 115
otu 490 20079 705
asv 490 104106 478
dada2 490 5508 630

Random Forest Model Performance with Significance

Random Forest on OTU level data yields the highest median AUC, followed by Family, Genus, and ASV. While the OTU AUC is not significantly higher than that of Family or Genus level, it is significantly higher than ASV.

Level Median AUC
phylum 0.585
class 0.605
order 0.659
family 0.687
genus 0.686
otu 0.698
asv 0.676

Logistic Regression Model Performance with Significance

Level Median AUC
phylum 0.587
class 0.590
order 0.609
family 0.604
genus 0.604
otu 0.616
asv 0.622

Decision Tree Model Performance with Significance

Level Median AUC
phylum 0.585
class 0.569
order 0.577
family 0.604
genus 0.595
otu 0.583
asv 0.566

SVM Radial

Level Median AUC
phylum 0.571
class 0.559
order 0.613
family 0.615
genus 0.616
otu 0.619
asv 0.628

XGBTree

Level Median AUC
phylum 0.608
class 0.623
order 0.657
family 0.668
genus 0.651
otu 0.672
asv 0.648

Feature Importance

The mikropml packages includes an option for finding feature importance using a permutation method. This significantly increases the time to run the models. The function permutes each model feature (e.g. genus) and recalculates the AUC with that feature permuted. The outputs are the permuted AUC value and the difference between this AUC value and the actual AUC value. Larger positive differences can be interpreted as being more important since the AUC value is smaller when this feature is permuted.

The importance values are all from the Random Forest model.

Top 10 Important features

Rank Phylum Class Order Family Genus OTU ASV
1 Fusobacteria Fusobacteriia Fusobacteriales Clostridiales_Incertae_Sedis_XI Porphyromonas Bacteria(100);Bacteroidetes(100);Bacteroidia(100);Bacteroidales(100);Porphyromonadaceae(100);Porphyromonas(100); Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Clostridiaceae_1(100);Clostridium_sensu_stricto(100);
2 Bacteroidetes Betaproteobacteria Synergistales Bacteroidaceae Bacteroides Bacteria(100);Fusobacteria(100);Fusobacteriia(100);Fusobacteriales(100);Fusobacteriaceae(100);Fusobacterium(100); Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Clostridiales_unclassified(100);Clostridiales_unclassified(100);
3 Synergistetes Negativicutes Bacillales Lachnospiraceae Gemella Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Lachnospiraceae(100);Lachnospiraceae_unclassified(100); Bacteria(100);Bacteroidetes(100);Bacteroidia(100);Bacteroidales(100);Prevotellaceae(100);Prevotella(100);
4 Actinobacteria Bacteroidia Coriobacteriales Synergistaceae Fusobacterium Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Ruminococcaceae(100);Ruminococcus(59); Bacteria(100);Bacteroidetes(100);Bacteroidia(100);Bacteroidales(100);Porphyromonadaceae(100);Porphyromonas(100);
5 Deinococcus-Thermus Synergistia Burkholderiales Bacillales_Incertae_Sedis_XI Ruminococcus Bacteria(100);Firmicutes(100);Erysipelotrichia(100);Erysipelotrichales(100);Erysipelotrichaceae(100);Coprobacillus(100); Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Ruminococcaceae(100);Ruminococcaceae_unclassified(100);
6 Verrucomicrobia Firmicutes_unclassified Selenomonadales Coriobacteriaceae Peptoniphilus Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Lachnospiraceae(100);Lachnospiraceae_unclassified(100); Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Lachnospiraceae(100);Blautia(100);
7 Firmicutes Deltaproteobacteria Clostridia_unclassified Enterobacteriaceae Anaerostipes Bacteria(100);Firmicutes(100);Bacilli(100);Bacillales(100);Bacillales_Incertae_Sedis_XI(100);Gemella(100); Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Clostridiales_unclassified(100);Clostridiales_unclassified(100);
8 Proteobacteria Verrucomicrobiae Actinomycetales Clostridia_unclassified Clostridium_XlVb Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Lachnospiraceae(100);Lachnospiraceae_unclassified(100); Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Ruminococcaceae(100);Ruminococcaceae_unclassified(100);
9 Bacteria_unclassified Bacteroidetes_unclassified Verrucomicrobiales Desulfovibrionaceae Akkermansia Bacteria(100);Bacteroidetes(100);Bacteroidia(100);Bacteroidales(100);Bacteroidaceae(100);Bacteroides(100); Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Lachnospiraceae(100);Lachnospiraceae_unclassified(100);
10 NA Bacilli Desulfovibrionales Clostridiales_unclassified Pseudoflavonifractor Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Ruminococcaceae(100);Ruminococcus(70); Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Ruminococcaceae(100);Ruminococcaceae_unclassified(100);